Programming with R {dplyr} - As I Understand It!!

Rstats
dplyr
functions

How to create your own functions using {dplyr}

Vishal Katti
07-17-2021

Introduction

The purpose of this document is to act as a quick guide for myself and others to understand how to use dplyr effectively to create dynamic functions. The general assumption is that the reader is familiar with the {dplyr} package and how to use it for data wrangling.

In this document, we will explore how to create functions using the popular dplyr verbs like select, filter, mutate, arrange and finally group_by with summarise.

Inspiration

I regularly deal with event-related information with event date and few other columns like event type, root cause etc. Most reports usually involve calculating number of events that took place on a monthly, quarterly or annual basis, sometimes split by event type, root cause and other columns. After a few reports I realized that I am basically writing the same code over and over again to calculate these KPIs. Keeping the DRY (Don't Repeat Yourself) principle in mind, I managed to write a few functions to calculate these KPIs with a few dynamic variables. Following is an attempt to articulate what I learnt while creating those functions.

Data

We shall use the Texas Housing Sales data, available as a tibble in the popular ggplot2 package as reference data. It contains monthly information about the housing market in Texas provided by the TAMU real estate center, https://www.recenter.tamu.edu/. It has 8602 observations and 9 variables.

txhousing <- ggplot2::txhousing
dplyr::glimpse(txhousing)
Rows: 8,602
Columns: 9
$ city      <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abile~
$ year      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20~
$ month     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4,~
$ sales     <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100~
$ volume    <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 1391~
$ median    <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 7~
$ listings  <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, ~
$ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, ~
$ date      <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, ~

We shall refer the above data in all the following sections.

select()

When using dplyr functions, the two most popular ways to pass column names is either as bare names i.e. column names without enclosing them in quotes like sales or volume OR pass them as a character string like “sales” or ‘volume’. You could also pass a character vector like c("sales", "volume"). In this section we will explore the 3 ways to dynamically select the columns we want.

Passing raw column names

In this method, we pass the raw name of the column we want to select and use the embrace of curly-curly brackets to pass the raw name. For multiple columns, we can pass the raw names as a single vector.

select_raw <- function(df, var) {
  dplyr::select(.data = df, {{var}}) %>%     # embrace of curly-curly {{}} brackets
    head()                                   # to limit the number of output rows in this example.
}
select_raw(txhousing, sales)                 # pass single raw name
# A tibble: 6 x 1
  sales
  <dbl>
1    72
2    98
3   130
4    98
5   141
6   156
select_raw(txhousing, c(sales, volume))      # pass a vector of raw names for multiple columns
# A tibble: 6 x 2
  sales   volume
  <dbl>    <dbl>
1    72  5380000
2    98  6505000
3   130  9285000
4    98  9730000
5   141 10590000
6   156 13910000

If passing multiple raw names as vector as in the select_raw() feels like an unnecessary complication, try the next method.

Passing multiple raw column names using … argument

In this method, we use the . argument to pass the raw names of the columns we want to select.

my_select <- function(df, ...) {
  dplyr::select(.data = df, ...) %>% 
    head()
}

my_select(txhousing, sales, volume)          # pass multiple raw names directly
# A tibble: 6 x 2
  sales   volume
  <dbl>    <dbl>
1    72  5380000
2    98  6505000
3   130  9285000
4    98  9730000
5   141 10590000
6   156 13910000

Passing a character vector of column names

If we have the column names as a character vector, we use the all_of function to pass the character vector to the internal select function.

my_select_char <- function(df, cols) {
  dplyr::select(.data = df, dplyr::all_of(cols)) %>% 
    head()
}

my_cols <- c("sales","volume")
my_select_char(txhousing, my_cols)
# A tibble: 6 x 2
  sales   volume
  <dbl>    <dbl>
1    72  5380000
2    98  6505000
3   130  9285000
4    98  9730000
5   141 10590000
6   156 13910000

filter()

In the previous section, we passed column names either as bare names or character strings. filter() takes one or more expressions/conditions that result in a logical vector, with same length as number of rows in the data.frame/tibble and returns only those rows for which the expression/condition returns TRUE. Following are 2 ways to pass these logical expressions/conditions. I’m using expression and condition interchangeably here. In this context, a condition is an expression that results in a boolean TRUE/FALSE result.

Passing single raw criteria

In this method, we pass the condition sales > 8000 as a raw/bare expression.

filter_raw <- function(df, cond) {
  dplyr::filter(.data = df, {{cond}})        # embrace of curly-curly {{}} brackets
}

filter_raw(txhousing, sales > 8000)          # Pass a single raw criterion
# A tibble: 10 x 9
   city     year month sales     volume median listings invent~1  date
   <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>    <dbl> <dbl>
 1 Houston  2006     5  8040 1602621368 151200    35398      5.5 2006.
 2 Houston  2006     6  8628 1795898108 155200    36281      5.6 2006.
 3 Houston  2013     5  8439 2121508529 186100    20526      3.3 2013.
 4 Houston  2013     7  8468 2168720825 187800    21497      3.3 2014.
 5 Houston  2013     8  8155 2083377894 186700    21366      3.3 2014.
 6 Houston  2014     6  8391 2342443127 211200    19725      2.9 2014.
 7 Houston  2014     7  8391 2278932511 199700    20214      3   2014.
 8 Houston  2014     8  8167 2195184825 202400    20007      2.9 2015.
 9 Houston  2015     6  8449 2490238594 222400    22311      3.2 2015.
10 Houston  2015     7  8945 2568156780 217600    23875      3.4 2016.
# ... with abbreviated variable name 1: inventory

Do you think we can pass multiple bare conditions as a vector, like we did for select_raw() in the previous section? Let us try passing multiple raw criteria as a vector.

filter_raw(txhousing, c(sales > 8000, year > 2010))
Error in `dplyr::filter()`:
! Problem while computing `..1 = c(sales > 8000, year >
  2010)`.
x Input `..1` must be of size 8602 or 1, not size 17204.

Passing multiple raw criteria as a vector doesn’t work like it works for select_raw() function. Let us understand why. Consider the following code:

A <- c(TRUE, TRUE)                           # boolean vector of length = 2
B <- c(FALSE, FALSE)                         # boolean vector of length = 2
X <- c(A, B)
X
[1]  TRUE  TRUE FALSE FALSE

Notice that length of X is 4. Similarly, sales > 8000 evaluates to a TRUE/FALSE boolean vector of length 8602 (equal to number of rows in txhousing) and so does year > 2010. So the vector c(sales > 8000, year > 2010) becomes a TRUE/FALSE boolean vector of length 17204, which results in an error.

Passing multiple raw criteria using … argument

To pass multiple raw criteria, we can use the ... argument.

my_filter <- function(df, ...) { 
  dplyr::filter(.data = df, ...)                # pass the dots argument
  }

my_filter(txhousing, sales > 8000, year > 2010) # pass multiple raw criteria
# A tibble: 8 x 9
  city     year month sales     volume median listings inventory  date
  <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Houston  2013     5  8439 2121508529 186100    20526       3.3 2013.
2 Houston  2013     7  8468 2168720825 187800    21497       3.3 2014.
3 Houston  2013     8  8155 2083377894 186700    21366       3.3 2014.
4 Houston  2014     6  8391 2342443127 211200    19725       2.9 2014.
5 Houston  2014     7  8391 2278932511 199700    20214       3   2014.
6 Houston  2014     8  8167 2195184825 202400    20007       2.9 2015.
7 Houston  2015     6  8449 2490238594 222400    22311       3.2 2015.
8 Houston  2015     7  8945 2568156780 217600    23875       3.4 2016.

Passing single criteria as a character string

By default, dplyr::filter() does not accept conditions as character strings. Following is an example which results in error

dplyr::filter(txhousing, "sales > 8000")
Error in `dplyr::filter()`:
! Problem while computing `..1 = "sales > 8000"`.
x Input `..1` must be a logical vector, not a character.

We need to convert the character condition into a raw expression.

my_filter_string <- function(df, cond) {
  dplyr::filter(.data = df, eval(parse(text = cond)))   # convert text to raw criterion
}

my_filter_string(txhousing, "sales > 8000")             # pass single text string as criteria
# A tibble: 10 x 9
   city     year month sales     volume median listings invent~1  date
   <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>    <dbl> <dbl>
 1 Houston  2006     5  8040 1602621368 151200    35398      5.5 2006.
 2 Houston  2006     6  8628 1795898108 155200    36281      5.6 2006.
 3 Houston  2013     5  8439 2121508529 186100    20526      3.3 2013.
 4 Houston  2013     7  8468 2168720825 187800    21497      3.3 2014.
 5 Houston  2013     8  8155 2083377894 186700    21366      3.3 2014.
 6 Houston  2014     6  8391 2342443127 211200    19725      2.9 2014.
 7 Houston  2014     7  8391 2278932511 199700    20214      3   2014.
 8 Houston  2014     8  8167 2195184825 202400    20007      2.9 2015.
 9 Houston  2015     6  8449 2490238594 222400    22311      3.2 2015.
10 Houston  2015     7  8945 2568156780 217600    23875      3.4 2016.
# ... with abbreviated variable name 1: inventory

The special sauce here is the eval(parse(text = ...)) combo that converts the long text criteria into a single raw criteria and passes it to the internal filter() function.

Passing multiple criteria as character vector

What if want to pass multiple criteria as a string vector? In such a situation, we must combine all the string conditions into a single long string condition using paste0(..., collapse = " & "). The paste0("(", cond, ")", collapse = " & ") combines all the criteria into a single long criteria, but still a text string.

my_filter_strings <- function(df, cond) { 
  filter_text <- paste0("(", cond, ")", collapse = " & ")   # combine all criteria
  message("Filter Condition: ", filter_text)                # (OPTIONAL) show the combined filter string
  dplyr::filter(.data = df, eval(parse(text = filter_text)))# convert text to raw criterion
  }

my_filter_criteria <- c("sales > 8000", "year > 2010")
my_filter_strings(txhousing, my_filter_criteria)
# A tibble: 8 x 9
  city     year month sales     volume median listings inventory  date
  <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Houston  2013     5  8439 2121508529 186100    20526       3.3 2013.
2 Houston  2013     7  8468 2168720825 187800    21497       3.3 2014.
3 Houston  2013     8  8155 2083377894 186700    21366       3.3 2014.
4 Houston  2014     6  8391 2342443127 211200    19725       2.9 2014.
5 Houston  2014     7  8391 2278932511 199700    20214       3   2014.
6 Houston  2014     8  8167 2195184825 202400    20007       2.9 2015.
7 Houston  2015     6  8449 2490238594 222400    22311       3.2 2015.
8 Houston  2015     7  8945 2568156780 217600    23875       3.4 2016.
my_filter_criteria_with_OR <- c("sales > 8000 | sales < 50", "year > 2010")

# NOTE: OR criteria must be a single string separated by pipe '|' as in example below.
my_filter_strings(txhousing, my_filter_criteria_with_OR)
# A tibble: 315 x 9
   city         year month sales  volume median listings inven~1  date
   <chr>       <int> <int> <dbl>   <dbl>  <dbl>    <dbl>   <dbl> <dbl>
 1 Brownsville  2011     1    48 4974408  83300      784    12.6 2011 
 2 Brownsville  2011     2    47 5558575 101400      776    12.7 2011.
 3 Brownsville  2011     7    47 4807019  91200      749    13.1 2012.
 4 Brownsville  2011    12    39 4203440  86800      726    12.4 2012.
 5 Brownsville  2012     1    43 3892348  85000      791    13.6 2012 
 6 Brownsville  2012     3    27 2976148  93800      734    13.3 2012.
 7 Brownsville  2012    11    41 5115393  99000      807    14   2013.
 8 Brownsville  2013    11    38 4824930 108000      859    13.4 2014.
 9 Brownsville  2015     1    41 5400796  97000      733    10.7 2015 
10 Galveston    2011     1    43 8882961 170000     1015    13.7 2011 
# ... with 305 more rows, and abbreviated variable name 1: inventory

mutate()

mutate() allows you to add new columns or modify existing columns. In the example below, we will create a new column volume_in_millions from the existing column volume. The names of both the columns can be passed to the function either as raw names or character strings.

Passing the column name as raw name

mutate_raw <- function(df, new_col_raw, old_col_raw, num = 1) { 
  dplyr::mutate(.data = df, {{new_col_raw}} := {{old_col_raw}}/num) %>% 
    head()
}

txhousing %>% 
  select(city, year, month, volume) %>% 
  mutate_raw(vol_in_millions, volume, 1E6) # pass raw column names w/o quotes
# A tibble: 6 x 5
  city     year month   volume vol_in_millions
  <chr>   <int> <int>    <dbl>           <dbl>
1 Abilene  2000     1  5380000            5.38
2 Abilene  2000     2  6505000            6.50
3 Abilene  2000     3  9285000            9.28
4 Abilene  2000     4  9730000            9.73
5 Abilene  2000     5 10590000           10.6 
6 Abilene  2000     6 13910000           13.9 

Passing the new variable name as character string (direct)

mutate_text <- function(df, new_col_str, old_col_str, num = 1) { 
  dplyr::mutate(.data = df, {{new_col_str}} := df[[old_col_str]]/num) %>% 
    head()
}

txhousing %>% 
  select(city, year, month, volume) %>%
  mutate_text("vol_in_millions", "volume", 1E6) # pass column names as strings
# A tibble: 6 x 5
  city     year month   volume vol_in_millions
  <chr>   <int> <int>    <dbl>           <dbl>
1 Abilene  2000     1  5380000            5.38
2 Abilene  2000     2  6505000            6.50
3 Abilene  2000     3  9285000            9.28
4 Abilene  2000     4  9730000            9.73
5 Abilene  2000     5 10590000           10.6 
6 Abilene  2000     6 13910000           13.9 

Passing the new variable name as character string (indirect)

Instead of passing the name of the variable as a character string as an argument, we can pass a variable containing the name of the variable. In the below example, the name of the new variable is stored in new_var. Using the new {glue} syntax, enabled by the walrus operator :=, we substitute the new_var variable with its value.

mutate_var <- function(df, new_col_var, old_col_var, num = 1) {
  dplyr::mutate(.data = df, "{new_col_var}" := df[[old_col_var]]/num) %>% 
    head()
}

new_var <- "vol_in_millions"
old_var <- "volume"

txhousing %>% 
  select(city, year, month, volume) %>%
  mutate_var(new_var, old_var, 1E6)  # pass column names as variables
# A tibble: 6 x 5
  city     year month   volume vol_in_millions
  <chr>   <int> <int>    <dbl>           <dbl>
1 Abilene  2000     1  5380000            5.38
2 Abilene  2000     2  6505000            6.50
3 Abilene  2000     3  9285000            9.28
4 Abilene  2000     4  9730000            9.73
5 Abilene  2000     5 10590000           10.6 
6 Abilene  2000     6 13910000           13.9 

arrange()

arrange() sorts the rows of a data frame by the values of selected columns. By default, it sorts in Ascending order. To force a column to sort in Descending order, we must use the desc() function.

Passing single raw name

arrange_raw <- function(df, var) {
  dplyr::arrange(.data = df, {{var}}) %>%    # embrace of curly-curly {{}} brackets
    head()
}

arrange_raw(txhousing, sales)
# A tibble: 6 x 9
  city            year month sales volume median listi~1 inven~2  date
  <chr>          <int> <int> <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl>
1 San Marcos      2011    10     6 1.16e6 180000     163     8.3 2012.
2 Harlingen       2000     7     9 1.11e6  87500     719    30.8 2000.
3 South Padre I~  2011     1     9 2.09e6 225000    1258    55.7 2011 
4 San Marcos      2011     1    10 1.48e6 140000     165     7.5 2011 
5 San Marcos      2011    12    10 1.56e6 140000     148     8   2012.
6 San Marcos      2014    11    10 1.51e6 146700      96     4   2015.
# ... with abbreviated variable names 1: listings, 2: inventory
arrange_raw(txhousing, desc(sales))
# A tibble: 6 x 9
  city     year month sales     volume median listings inventory  date
  <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Houston  2015     7  8945 2568156780 217600    23875       3.4 2016.
2 Houston  2006     6  8628 1795898108 155200    36281       5.6 2006.
3 Houston  2013     7  8468 2168720825 187800    21497       3.3 2014.
4 Houston  2015     6  8449 2490238594 222400    22311       3.2 2015.
5 Houston  2013     5  8439 2121508529 186100    20526       3.3 2013.
6 Houston  2014     6  8391 2342443127 211200    19725       2.9 2014.

arrange_raw() fails when we pass multiple raw names as a vector.

arrange_raw(txhousing, c(sales, volume))
Error in `dplyr::arrange()`:
! Problem with the implicit `transmute()` step.
x Problem while computing `..1 = c(sales, volume)`.
x `..1` must be size 8602 or 1, not 17204.

Passing multiple raw names using ... argument

To pass multiple raw names, we must use the ... argument.

arrange_raw_multiple <- function(df, ...) {
  dplyr::arrange(.data = df, ...) %>% 
    head()
}

arrange_raw_multiple(txhousing, city, sales)
# A tibble: 6 x 9
  city     year month sales  volume median listings inventory  date
  <chr>   <int> <int> <dbl>   <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Abilene  2003     1    68 5385000  70000      668       5.4  2003
2 Abilene  2011     1    68 8834493 123300      809       6.1  2011
3 Abilene  2009     1    70 8414801  92900      861       6.3  2009
4 Abilene  2000     1    72 5380000  71400      701       6.3  2000
5 Abilene  2010     1    73 9130783 112200      868       6.4  2010
6 Abilene  2001     1    75 5730000  64500      779       6.8  2001
arrange_raw_multiple(txhousing, city, desc(sales))
# A tibble: 6 x 9
  city     year month sales   volume median listings inventory  date
  <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Abilene  2015     7   268 45845730 148700      986       5   2016.
2 Abilene  2015     6   260 41396230 141500      965       5   2015.
3 Abilene  2007     7   239 29315000 114300      940       5.2 2008.
4 Abilene  2013     8   236 30777727 120000      976       5.4 2014.
5 Abilene  2014     7   231 35861350 145800     1033       5.8 2014.
6 Abilene  2005     6   230 24050000  92500      664       4.1 2005.

Pass single column name as string

arrange_str <- function(df, var, .desc = FALSE) {
  if (.desc) {
    dplyr::arrange(.data = df, desc(df[[var]])) %>% head()
  } else {
    dplyr::arrange(.data = df, df[[var]]) %>% head()
  }
}

arrange_str(txhousing, "sales")
# A tibble: 6 x 9
  city            year month sales volume median listi~1 inven~2  date
  <chr>          <int> <int> <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl>
1 San Marcos      2011    10     6 1.16e6 180000     163     8.3 2012.
2 Harlingen       2000     7     9 1.11e6  87500     719    30.8 2000.
3 South Padre I~  2011     1     9 2.09e6 225000    1258    55.7 2011 
4 San Marcos      2011     1    10 1.48e6 140000     165     7.5 2011 
5 San Marcos      2011    12    10 1.56e6 140000     148     8   2012.
6 San Marcos      2014    11    10 1.51e6 146700      96     4   2015.
# ... with abbreviated variable names 1: listings, 2: inventory
arrange_str(txhousing, "sales", .desc = TRUE)
# A tibble: 6 x 9
  city     year month sales     volume median listings inventory  date
  <chr>   <int> <int> <dbl>      <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Houston  2015     7  8945 2568156780 217600    23875       3.4 2016.
2 Houston  2006     6  8628 1795898108 155200    36281       5.6 2006.
3 Houston  2013     7  8468 2168720825 187800    21497       3.3 2014.
4 Houston  2015     6  8449 2490238594 222400    22311       3.2 2015.
5 Houston  2013     5  8439 2121508529 186100    20526       3.3 2013.
6 Houston  2014     6  8391 2342443127 211200    19725       2.9 2014.

Pass multiple column name as string

arrange_str_multiple <- function(df, var, desc = FALSE) {
  if (desc) {
    dplyr::arrange(.data = df, desc(df[var])) %>% head()
  } else {
    dplyr::arrange(.data = df, df[var]) %>% head()
  }
}

# This function arranges the dataframe either all ascending
# or all descending. Definitely need a better example.

arrange_str_multiple(txhousing, c("year", "month", "sales"))
# A tibble: 6 x 9
  city         year month sales  volume median listings invent~1  date
  <chr>       <int> <int> <dbl>   <dbl>  <dbl>    <dbl>    <dbl> <dbl>
1 Paris        2000     1    19 1440000  71700      286      7.5  2000
2 San Marcos   2000     1    22 2380000 106700      190      6.3  2000
3 Lufkin       2000     1    28 2280000  68000       NA     NA    2000
4 Harlingen    2000     1    31 3910000  87500      644     24.9  2000
5 Galveston    2000     1    37 4555000  95000      636      9.1  2000
6 Port Arthur  2000     1    40 3090000  68300      314      5.6  2000
# ... with abbreviated variable name 1: inventory
arrange_str_multiple(txhousing, c("year", "month", "sales"), desc = TRUE)
# A tibble: 6 x 9
  city           year month sales  volume median listi~1 inven~2  date
  <chr>         <int> <int> <dbl>   <dbl>  <dbl>   <dbl>   <dbl> <dbl>
1 Houston        2015     7  8945  2.57e9 217600   23875     3.4 2016.
2 Dallas         2015     7  7038  2.02e9 233000   12292     2.4 2016.
3 Austin         2015     7  3466  1.15e9 264600    7913     3   2016.
4 San Antonio    2015     7  2962  7.05e8 198100    9462     4.1 2016.
5 Collin County  2015     7  1861  6.14e8 292600    2809     2.1 2016.
6 Fort Bend      2015     7  1372  4.32e8 280400    3328     3.1 2016.
# ... with abbreviated variable names 1: listings, 2: inventory

group_by()

In group_by(), we select which columns to, well, group by! (Damn these well-named functions!). So one can use the same techniques as select() to choose the columns.

In the following examples, we will create only one summarised value total_sales for simplicity.

Passing single raw name

group_raw <- function(df, grp) {
  df %>% 
    group_by({{grp}}) %>% 
    summarise(total_sales = sum(sales, na.rm = TRUE),
              .groups = 'drop')  %>% 
    head(n=5)
}

group_raw(txhousing, year)        # Sum of sales per year
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
group_raw(txhousing, month)       # Sum of sales per month
# A tibble: 5 x 2
  month total_sales
  <int>       <dbl>
1     1      245924
2     2      296410
3     3      386909
4     4      397332
5     5      448968

Passing multiple raw names using the ... operator

group_raw_multiple <- function(df, ...) {
  df %>% 
    group_by(...) %>% 
    summarise(total_sales = sum(sales, na.rm = TRUE),
              .groups = 'drop')  %>% 
    head(n = 5)
}

group_raw_multiple(txhousing, year)              # Sum of sales per year
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
group_raw_multiple(txhousing, year, month)       # Sum of sales per month
# A tibble: 5 x 3
   year month total_sales
  <int> <int>       <dbl>
1  2000     1       11411
2  2000     2       15674
3  2000     3       20202
4  2000     4       18658
5  2000     5       22388

Passing single or multiple column names as character string

group_str <- function(df, grp) {
  df %>% 
    group_by(df[grp]) %>% 
    summarise(total_sales = sum(sales, na.rm = TRUE),
              .groups = 'drop')  %>% 
    head(n=5)
}

group_str(txhousing, "year")                   # Sum of sales per year
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
group_str(txhousing, c("year", "month"))       # Sum of sales per month
# A tibble: 5 x 3
   year month total_sales
  <int> <int>       <dbl>
1  2000     1       11411
2  2000     2       15674
3  2000     3       20202
4  2000     4       18658
5  2000     5       22388
# The same column names can be passed as variables containing the character names
yr <- "year"
group_str(txhousing, yr)
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
yrmon <- c("year", "month")
group_str(txhousing, yrmon)
# A tibble: 5 x 3
   year month total_sales
  <int> <int>       <dbl>
1  2000     1       11411
2  2000     2       15674
3  2000     3       20202
4  2000     4       18658
5  2000     5       22388

If you want the summarise column to have a custom name like total_<sumvar>, then you can wrap the value in quotes as below. This method uses the glue syntax enabled by the := walrus operator. The walrus operator takes either a raw name or a character string on its LHS.

group_raw2 <- function(df, grp, sumvar) {
  df %>% 
    group_by({{grp}}) %>% 
    summarise("total_{{sumvar}}" := sum({{sumvar}}, na.rm = TRUE),
              .groups = 'drop')  %>% 
    head(n=5)
}

group_raw2(txhousing, year, sales)            # Sum of sales per year
# A tibble: 5 x 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999
group_raw2(txhousing, month, listings)        # Sum of listings per month
# A tibble: 5 x 2
  month total_listings
  <int>          <dbl>
1     1        1854661
2     2        1888104
3     3        1949187
4     4        1991278
5     5        2038932

After writing so many examples, I see a pattern. group_by() works with techniques similar to select() while summarise() works with techniques similar to mutate().

(Slightly Better) Examples

The txhousing is a city-wise monthly sales and volume dataset. It has a year and month column. Let us create a date column and keep only those columns relevant for our custom tx_summary() function.

small_df <- txhousing %>% 
  mutate(date = lubridate::as_date(glue::glue("{year}-{month}-01"))) %>% 
  select(city, date, sales, volume)

mutate() example

Now let us create the create_ymq() function. This function would take 2 arguments, a data frame df and a raw name of a date column.

create_ymq <- function(df, date_col) {
  stopifnot(inherits(df, "data.frame"))
  stopifnot(class(df %>% dplyr::pull({{date_col}})) == 'Date')
  dplyr::mutate(df,
                Year = lubridate::year({{date_col}}),
                nHalf = lubridate::semester({{date_col}}),
                yHalf = lubridate::semester({{date_col}}, with_year = TRUE),
                dHalf = paste0(lubridate::semester({{date_col}}), "H", format({{date_col}},"%y")),
                nQtr = lubridate::quarter({{date_col}}),
                yQtr = lubridate::quarter({{date_col}}, with_year = TRUE),
                dQtr = paste0(lubridate::quarter({{date_col}}),"Q", format({{date_col}},"%y")),
                Month = lubridate::month({{date_col}}),
                yMonth = as.numeric(format({{date_col}}, "%Y.%m")),
                dMonth = format({{date_col}}, "%b %Y")
                )
}

create_ymq(df = small_df, date_col = date) %>% glimpse()
Rows: 8,602
Columns: 14
$ city   <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene"~
$ date   <date> 2000-01-01, 2000-02-01, 2000-03-01, 2000-04-01, 2000~
$ sales  <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 9~
$ volume <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 1391000~
$ Year   <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,~
$ nHalf  <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,~
$ yHalf  <dbl> 2000.1, 2000.1, 2000.1, 2000.1, 2000.1, 2000.1, 2000.~
$ dHalf  <chr> "1H00", "1H00", "1H00", "1H00", "1H00", "1H00", "2H00~
$ nQtr   <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1, 1, 1, 2, 2, 2,~
$ yQtr   <dbl> 2000.1, 2000.1, 2000.1, 2000.2, 2000.2, 2000.2, 2000.~
$ dQtr   <chr> "1Q00", "1Q00", "1Q00", "2Q00", "2Q00", "2Q00", "3Q00~
$ Month  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5,~
$ yMonth <dbl> 2000.01, 2000.02, 2000.03, 2000.04, 2000.05, 2000.06,~
$ dMonth <chr> "Jan 2000", "Feb 2000", "Mar 2000", "Apr 2000", "May ~

group_by() example

Now that we have a function that creates various date-related columns, let us create a function that let’s you create summary tables like annual sales per city, quarterly volumes per city etc.

tx_summary <- function(df, grp_col, sum_col) {
  df %>% 
    group_by(city, {{grp_col}}) %>% 
    summarise("total_{{sum_col}}" := sum({{sum_col}}, na.rm = TRUE), .groups = 'drop')
}

Using these 2 functions, we can now create multiple summary tables

small_df_with_date_cols <- small_df %>% create_ymq(date_col = date)

# Annual Sales per city
small_df_with_date_cols %>% tx_summary(grp_col = Year, sum_col = sales)
# A tibble: 736 x 3
   city     Year total_sales
   <chr>   <dbl>       <dbl>
 1 Abilene  2000        1375
 2 Abilene  2001        1431
 3 Abilene  2002        1516
 4 Abilene  2003        1632
 5 Abilene  2004        1830
 6 Abilene  2005        1977
 7 Abilene  2006        1997
 8 Abilene  2007        2003
 9 Abilene  2008        1651
10 Abilene  2009        1634
# ... with 726 more rows
# Half Yearly volumes per city
small_df_with_date_cols %>% tx_summary(grp_col = yHalf, sum_col = volume)
# A tibble: 1,472 x 3
   city    yHalf total_volume
   <chr>   <dbl>        <dbl>
 1 Abilene 2000.     55400000
 2 Abilene 2000.     53175000
 3 Abilene 2001.     55795000
 4 Abilene 2001.     58570000
 5 Abilene 2002.     55305000
 6 Abilene 2002.     63370000
 7 Abilene 2003.     58175000
 8 Abilene 2003.     77500000
 9 Abilene 2004.     74205000
10 Abilene 2004.     85465000
# ... with 1,462 more rows
# Quarterly Sales per city
small_df_with_date_cols %>% tx_summary(grp_col = yQtr, sum_col = sales)
# A tibble: 2,898 x 3
   city     yQtr total_sales
   <chr>   <dbl>       <dbl>
 1 Abilene 2000.         300
 2 Abilene 2000.         395
 3 Abilene 2000.         387
 4 Abilene 2000.         293
 5 Abilene 2001.         305
 6 Abilene 2001.         394
 7 Abilene 2001.         401
 8 Abilene 2001.         331
 9 Abilene 2002.         295
10 Abilene 2002.         425
# ... with 2,888 more rows
# Monthly Volumes per city
small_df_with_date_cols %>% tx_summary(grp_col = yMonth, sum_col = volume)
# A tibble: 8,602 x 3
   city    yMonth total_volume
   <chr>    <dbl>        <dbl>
 1 Abilene  2000.      5380000
 2 Abilene  2000.      6505000
 3 Abilene  2000.      9285000
 4 Abilene  2000.      9730000
 5 Abilene  2000.     10590000
 6 Abilene  2000.     13910000
 7 Abilene  2000.     12635000
 8 Abilene  2000.     10710000
 9 Abilene  2000.      7615000
10 Abilene  2000.      7040000
# ... with 8,592 more rows

More ideas

You could further extend this by creating a custom filtering function that gives you, say, the rows with the highest or lowest total_sales or total_volume.

Conclusion

The ability to create such dynamic functions, enabled by the wonderful {dplyr} package, allows us to level-up in terms of programming with R and helps make our code neat and tidy.

How I feel while creating custom functions with {dplyr}! I can almost hear the music! Source: imgur.com

References